Consensus generation and variant detection by Celera Assembler

نویسندگان

  • Gennady Denisov
  • Brian Walenz
  • Aaron L. Halpern
  • Jason R. Miller
  • Nelson Axelrod
  • Samuel Levy
  • Granger G. Sutton
چکیده

MOTIVATION We present an algorithm to identify allelic variation given a Whole Genome Shotgun (WGS) assembly of haploid sequences, and to produce a set of haploid consensus sequences rather than a single consensus sequence. Existing WGS assemblers take a column-by-column approach to consensus generation, and produce a single consensus sequence which can be inconsistent with the underlying haploid alleles, and inconsistent with any of the aligned sequence reads. Our new algorithm uses a dynamic windowing approach. It detects alleles by simultaneously processing the portions of aligned reads spanning a region of sequence variation, assigns reads to their respective alleles, phases adjacent variant alleles and generates a consensus sequence corresponding to each confirmed allele. This algorithm was used to produce the first diploid genome sequence of an individual human. It can also be applied to assemblies of multiple diploid individuals and hybrid assemblies of multiple haploid organisms. RESULTS Being applied to the individual human genome assembly, the new algorithm detects exactly two confirmed alleles and reports two consensus sequences in 98.98% of the total number 2,033311 detected regions of sequence variation. In 33,269 out of 460,373 detected regions of size >1 bp, it fixes the constructed errors of a mosaic haploid representation of a diploid locus as produced by the original Celera Assembler consensus algorithm. Using an optimized procedure calibrated against 1 506 344 known SNPs, it detects 438 814 new heterozygous SNPs with false positive rate 12%. AVAILABILITY The open source code is available at: http://wgs-assembler.cvs.sourceforge.net/wgs-assembler/

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Assembly algorithms for next-generation sequencing data.

The emergence of next-generation sequencing platforms led to resurgence of research in whole-genome shotgun assembly algorithms and software. DNA sequencing data from the Roche 454, Illumina/Solexa, and ABI SOLiD platforms typically present shorter read lengths, higher coverage, and different error profiles compared with Sanger sequencing data. Since 2005, several assembly software packages hav...

متن کامل

Fragment assembly with double-barreled data

For the last twenty years fragment assembly was dominated by the "overlap - layout - consensus" algorithms that are used in all currently available assembly tools. However, the limits of these algorithms are being tested in the era of genomic sequencing and it is not clear whether they are the best choice for large-scale assemblies. Although the "overlap - layout - consensus" approach proved to...

متن کامل

Design of a compartmentalized shotgun assembler for the human genome

Two different strategies for determining the human genome are currently being pursued: one is the "clone-by-clone" approach, employed by the publicly funded project, and the other is the "whole genome shotgun assembler" approach, favored by researchers at Celera Genomics. An interim strategy employed at Celera, called compartmentalized shotgun assembly, makes use of preliminary data produced by...

متن کامل

Sequencing the Bonobo Genome

The Bonobo Genome Consortium generated DNA sequencing reads representing the genome of a single bonobo individual. The data consisted of almost 270 million fragment sequences generated on FLX machines from 454 Life Sciences. The fragments derived from FLX standard and Titanium chemistries, and from paired and unpaired protocols. The data was assembled at the J. Craig Venter Institute with the o...

متن کامل

Additional file 7. Overview of the Sprai algorithm and its performance

The detailed algorithm of Sprai will be published elsewhere, but we give an overview of it in this study. Sprai is primarily designed for correcting sequencing errors in single-molecule sequencing reads, and therefore can be integrated with any analysis tools that accept long reads of high accuracy. Among numerous kinds of genome analysis, de novo genome assembly is one of the most common analy...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • Bioinformatics

دوره 24 8  شماره 

صفحات  -

تاریخ انتشار 2008